Influenza A viruses are notorious and exemplary pathogens that cross host-species boundaries. By no means exclusively, yet the potential for IAV cross-species emergence is one of the most keenly felt biological threats in the world. The proclivity of these viruses to circulate in multiple vertebrate species, including many of the species reared agriculturally, poses great importance on understanding how and when host-jumps may occur. In addition, huge amounts of resources are spent annually to try and predict whether IAV will jump between species, and what those consequences may entail. To make these predictions, many different angles of influenza virus biology are analysed from evolutionary history to ecological contact networks to structural/molecular modelling of host-pathogen protein interactions to name but a few. However, exploring these features independently can obfuscate patterns and/or lead to pseudo-replication - is he relationship between protein structures already accounted for by viral phylogenetic histories?
Our cross-specialisation approach first uses protein compositional features to train machine learning models on classifying the original host the virus was obtained from. - The features are x and show y, included them because z Then, using the same protein sequences, we constructed mcc trees and reconstructed ancestral node traits using the BEAST suite. Finally, we compared the results of both modelling and phylogenetic approaches to detect any overlap or, more interestingly, any inferences that were exclusive to only one method.
Sequences were collected from NCBI Virus Database. This dataset was filtered to only include samples with: a) complete nucleotide sequences, missing no more than 10% of the average length for that segment b) metadata containing the virus subtype and original host, at least up to the Class level (Aves or Mammalia) through preferably detailing all the way down to the Family or Species level Nucleotide sequences were then assigned a random identifier (between AAA-000-AAA and ZZZ-999-ZZZ) to associate with metadata throughout analyses.
Samples from mammalian hosts were then retained only if the original host was within the Canidae, Suidae, Equidae, Hominidae or Phyllostomidae families, as viruses within these hosts were expected to display clear examples of adaptation. Viral sequences of mammalian-origin were then reduced further by excluding samples where the subtype does not circulate naturally within closed systems of that host (i.e. removing spillovers or stuttering chains of transmission, neither of which represent the kind of host adaptation we sought to detect). Included subtypes for each family: Canidae (H3N2, H3N8), Equidae (H7N7, H3N8) and finally Hominidae (H1N1, H1N2, H2N2 and H3N2).
Corresponding protein sequences were then obtained for each sample, most of which were available from the NCBI Virus Database. The few genomic sequences without matching proteins were translated locally using the ‘ape’ package in R.
Proteins were then aligned using MAFFT into 20 resulting protein fasta files (two for each of the 10 major IAV proteins, one containing mammalian-origin viruses and the other avian-origin). These were then clustered using the linear algorithm ‘easy-linclust’ within the tool MMSeq2. Three variables of the clustering algorithm were experimented with in optimising data downsampling: the percentage identity threshold required to delineate sequences from one another (–min-seq-id 0.5, 0.75, 0.85, 0.9, 0.95, 0.99), coverage of the sequences within each cluster (-c 0.5, 0.7, 0.8) and the way in which the coverage is calculated (–cov-mode 0, 1). Ultimately options were set to –min-seq-id 0.95, -c 0.8 and –cov-mode 1 in order to grant a representative number of samples without introducing redundancy. Ultimately, x proteins were nominated leading to sequences from 4551 individual viruses.
| Protein | 100% | 95% | 85% | 75% | 100% | 95% | 85% | 75% |
|---|---|---|---|---|---|---|---|---|
| 01PB2 | 15507 | 135 | 33 | 18 | 21880 | 2081 | 27 | 17 |
| 02PB1 | 15447 | 77 | 36 | 27 | 21829 | 1623 | 62 | 29 |
| 03PA | 15441 | 2404 | 19 | 11 | 21856 | 1758 | 25 | 12 |
| 04HA | 15380 | 1943 | 46 | 28 | 22106 | 2038 | 32 | 20 |
| 05NP | 14682 | 52 | 8 | 5 | 21865 | 70 | 9 | 6 |
| 06NA | 15442 | 544 | 36 | 25 | 21845 | 294 | 21 | 13 |
| 07M1 | 15335 | 33 | 3 | 3 | 21791 | 130 | 2 | 1 |
| 07M2 | 15124 | 555 | 58 | 14 | 21770 | 291 | 18 | 7 |
| 08NS1 | 15293 | 554 | 115 | 5 | 21737 | 517 | 35 | 5 |
| 08NEP | 15228 | 437 | 29 | 9 | 21760 | 231 | 17 | 2 |
Proteins from avian-origin and mammalian-origin viruses retained by MMSeq2 were then aligned into a single fasta file for each protein. Accessory proteins and those spliced from alternate reading frames (M2, NEP, PB1-F2 and PA-X) were then excluded from any further analyses as they were too short and/or conserved to show patterns of adaptation. OmegaFeaturesCLI was run over the alignments to produce matrices of protein compositional and physio-chemical properties with the ‘get_descriptor’ command. Features, and the associated command, are as follows: 2mers (‘DPC type 1’), Chou’s pseudo-amino acid composition (‘PseAAC’), conjoined triad (‘CTriad’), and finally Composition (CTDc), & Transition (CTDt) & Distribution (CTDd) values.
Six datasets of protein features were derived from the iFeatureOmega processing: amino acid two-mers (DPC, 400 cols), (CTDc, CTDd and CTDt, x y and z cols respectively), paac and triad (X cols). These were then annotated with the subtype of the origin virus and labelled with the original host.
Matrices of protein features were back-referenced to add viral metadata (origin host and subtype). These were then used to create Random Forest classification models, using the features of each of the 8 main IAV proteins to distinguish between the classes of hosts from which viruses were sampled.
In order to create test data, around subtypes that appeared most frequently (>20 times) were iteratively held out of model construction. In addition to these 23 subtypes, both bat-exclusive subtypes (H17N10 and H18N11) were held as test data. The remaining 70 subtypes comprised 308/4551 (~7%) sequences. Six Random Forest models were created from each of the feature datasets, for each holdout. These models were then stacked with the use of the ‘caret’ package. Models were weighted in order to accomodate for disproportionate sizes of host groups by using the inverse of that group’s proportion.
Virus subtypes were used to subset feature datasets, the 25 highlighted subtypes were removed iteratively: first creating six datasets without H1N1 viruses, then six datasets without H1N2 viruses etc.
These datasets were then used to train Random Forest models, leading to 2400 models in total: 8 proteins, 6 chemical and compositional features, 25 subtype holdouts and either two-class or multi-class models. Each models’ training statistics were extracted for preliminary validation. Models for each protein and subtype were then compiled with the Stacking algorithm by caret.
| Protein | Model | Node |
|---|---|---|
| PB2 | FLU+R4 | x |
| PB1 | FLU+R4 | x |
| PA | FLU+R5 | x |
| HA | FLU+R8 | x |
| NP | Q.mammal+R4 | x |
| NA | FLU+R5 | x |
| M1 | Q.mammal+R3 | x |
| NS1 | HIVw+R6 | x |
phylogenetic trees were estimated using the BEAST suite of softwares. Initial model parameters such as evolutionary and codon-partitioning models were estimated by ModelFinder within IQTree2, and temporality was confirmed with TempEst. Maximum Likelihood trees were estimated with substitution models suggested by ModelFinder (Table X) and bootstrapped 1000 times for validation.
Phylogenetic models were first trialled on Maximum Likelihood trees with the use of BayesTraits v3.0; in the R wrapper “btw”. A discrete-trait analysis estimated the origin host of viruses and protein features were modelled as continuous traits. Trait analyses were first modelled using tips features and then ancestral trait reconstruction upon internal nodes, both using multistate reverse-jump MCMC procedures. Initially, transition rates between discrete states (i.e. Host) were examined across each protein tree before the states of ancestral nodes were estimated.
Specific nodes were selected for ancestral trait reconstruction due to known historical host-switching events. Recognised host switch events include: human viruses that originated from avian and swine reassortments in 1918, 1957 and 1968; emergence of swine viruses into human pandemic strains (H1N1pdm09); avian IAV into horses (H7N7 and H3N8) and the jump of equine H3N8 viruses into canine populations.
Tree estimations in BEAST used the following parameters and methods: … Furthermore, in protein features were modelled as continuous traits under ancestral node reconstruction. Likewise, following Seb’s work, protein sequences themselves were estimated at each ancestral node, based on the sequences of each leaf.
| Protein | Avian | Mammalian |
|---|---|---|
| PB2 | 338 | 561 |
| PB1 | 303 | 255 |
| PA | 282 | 415 |
| HA | 1406 | 919 |
| NP | 215 | 244 |
| NA | 185 | 207 |
| M1 | 73 | 149 |
| NS1 | 539 | 606 |
The clustering resulting from removing sequences with a 95% sequence identity left us with 6,697 protein sequences overall, representing 4551 individual virus genomes. To assist with building the model architecture, any virus genome that had at least one representative protein was analysed fully - hence 4551 samples for each of the 8 proteins analysed, despite the high sequence homology between many of them. The number of representative proteins called by the clustering algorithm is shown in Table X and full lists of the associated accession codes are available in Supplementary X.
Confusion matrices made between known origin host and those predicted by the stack models were used to assess model predictive strength and find evidence of possible host shifts. Counter-intuitively, a lower model accuracy implied greater evidence of adaptation to the host species; the mislabelling of proteins is indicative of host-specific adaptations to non-native hosts. For example, should a human-origin virus be incorrectly labelled as swine-origin by the model, this supports the hypothesis that the protein retains some signal of adaptation to swine hosts. Overall, confusion matrices were evaluated based on accuracy and F1 statistics (shown in Table Y).
| Protein | F1.micro | F1.macro | AUC |
|---|---|---|---|
| PB2 | 0.8431141 | 0.5176765 | 0.570 |
| PB1 | 0.8074284 | 0.3840304 | 0.527 |
| PA | 0.8750299 | 0.5749008 | 0.575 |
| HA | 0.7699953 | 0.2947088 | 0.558 |
| NP | 0.8840125 | 0.4957010 | 0.556 |
| NA | 0.7349682 | 0.3321250 | 0.527 |
| M1 | 0.8632233 | 0.5168716 | 0.559 |
| NS1 | 0.9098030 | 0.5188600 | 0.565 |
| Protein | multi-class | two-class |
|---|---|---|
| 01PB2 | 84.31% | 95.58% |
| 02PB1 | 80.74% | 92.78% |
| 03PA | 87.50% | 95.96% |
| 04HA | 77.00% | 96.95% |
| 05NP | 88.40% | 97.83% |
| 06NA | 73.50% | 87.57% |
| 07M1 | 86.32% | 96.87% |
| 08NS1 | 90.98% | 97.77% |
| Protein | Matched | Mismatched |
|---|---|---|
| 01PB2 | 404 | 145 |
| 02PB1 | 375 | 179 |
| 03PA | 573 | 118 |
| 04HA | 1760 | 240 |
| 05NP | 219 | 173 |
| 06NA | 316 | 76 |
| 07M1 | 145 | 76 |
| 08NS1 | 1008 | 90 |
ML trees Trees were first estimated with IQTree2
## [1] 1674
## [1] 1099
M1
“RecNode…S.1….P.A.” “RecNode…S.1….P.B.” [307] “RecNode…S.1….P.D.” “RecNode…S.1….P.E.” “RecNode…S.1….P.H.” [310] “RecNode…S.1….P.P.”
[322] “RecNode…S.2….P.A.” “RecNode…S.2….P.B.” “RecNode…S.2….P.D.” [325] “RecNode…S.2….P.E.” “RecNode…S.2….P.H.” “RecNode…S.2….P.P.”
“RecNode…S.3….P.A.” [340] “RecNode…S.3….P.B.” “RecNode…S.3….P.D.” “RecNode…S.3….P.E.” [343] “RecNode…S.3….P.H.” “RecNode…S.3….P.P.”
“RecNode…S.4….P.A.” “RecNode…S.4….P.B.” [358] “RecNode…S.4….P.D.” “RecNode…S.4….P.E.” “RecNode…S.4….P.H.” [361] “RecNode…S.4….P.P.”
[373] “RecNode…S.5….P.A.” “RecNode…S.5….P.B.” “RecNode…S.5….P.D.” [376] “RecNode…S.5….P.E.” “RecNode…S.5….P.H.” “RecNode…S.5….P.P.”
“RecNode…S.6….P.A.” [391] “RecNode…S.6….P.B.” “RecNode…S.6….P.D.” “RecNode…S.6….P.E.” [394] “RecNode…S.6….P.H.” “RecNode…S.6….P.P.” “RecNode…S.7….P…”
“RecNode…S.7….P.A.” “RecNode…S.7….P.B.” [409] “RecNode…S.7….P.D.” “RecNode…S.7….P.E.” “RecNode…S.7….P.H.” [412] “RecNode…S.7….P.P.”
[424] “RecNode…S.8….P.A.” “RecNode…S.8….P.B.” “RecNode…S.8….P.D.” [427] “RecNode…S.8….P.E.” “RecNode…S.8….P.H.” “RecNode…S.8….P.P.” “RecNode…S.9….P.A.” [442] “RecNode…S.9….P.B.” “RecNode…S.9….P.D.” “RecNode…S.9….P.E.” [445] “RecNode…S.9….P.H.” “RecNode…S.9….P.P.” “RecNode…S.10….P.A.” “RecNode…S.10….P.B.” [460] “RecNode…S.10….P.D.” “RecNode…S.10….P.E.” “RecNode…S.10….P.H.” [463] “RecNode…S.10….P.P.”